Libris Britannia 4

home *** CD-ROM | disk | FTP | other *** search

/ Libris Britannia 4 / science library(b).zip / science library(b) / MATH / MFLOAT10.ZIP / MFLOAT.DOC < prev next >

Wrap

Text File | 1993-04-28 | 71KB | 1,679 lines

MFLOAT Version 1.00 =================== A brief description of "MFLOAT": * "MFLOAT" handles arithmetic using high precision numbers of up to 72 decimal digits. * "MFLOAT" is optimized for maximum speed for high precision numbers. These subroutines are the fastest for the 8086 microprocessor known to us. If you find faster subroutines for a 8086, please send me a mail. * "MFLOAT" subroutines are much faster than a standard emulation of a coprocessor for the same accuracy. "MFLOAT" does not use the coprocessor. * "MFLOAT" is written in assembler for the 8086 microprocessor and compatibles. The calculation is 10 to 20 times faster than that of subroutines written in C or PASCAL. This is important for tedious calculations like an optimization, where you have problems with the accuracy. For this problems you can avoid the use of a super computer which you feed with a C or PASCAL program for high precision numbers. * All important functions of PASCAL and of a C library ("MATH.H") are included. This includes the transcedental functions like sin(x), atan(x), exp(x), log(x).... * Programs written in C or C++, can use MFLOAT simply by replacing the double data format with the "MFLOAT" data format. The operators and the functions are overloaded for C++ and so only little changes of the source code are necessary. (You need BORLAND C 3.1). * The subroutines are included in compiled form (object files) in a program. Therfore you can use them for PASCAL, C and C++. The subroutines are tested for BORLAND PASCAL 7.0, TURBO PASCAL 6.0 and BORLAND C 3.1. * This package is limited to about 72 decimal digits. You may license the source code. Then you can extend the precision for the algebraic calculations up to 10000 decimal digits (upper limit is due to the memory limit of one segment of the 8086) and for transcendental function up to 500 digits. * The internal data format is a binary floating point format. The length of the mantissa can be choosen by the user from one word to 15 words. (1 word = 16 bits). This correspondeds to about 5, 10, 14, 19, 24, 29, 34, 39, 43, 48, 53, 58, 63, 67 and 72 digits of a decimal number. 1 word represents 4.81648 digits of a decimal number in average. The exponent of two has the length of 16 bits. The largest representable number is about 7.07E+9863, the smallest is about 7.07E-9865. *************************************************************************** The package includes following files: READ.ME - general information MFLOAT.DOC - describtion of the subroutines and mfloat numbers MFLOATA.OBJ - subroutines written in assembler MFLOATB.OBJ - subroutines written in C MFLOAT.H - header file for C MFLOAT.CXX - file for C++ : overloading functions and operators PFLOAT.PAS - unit: declarations of the external subroutines PI.PAS - example in PASCAL, calculation of pi BESSEL.PAS - example in PASCAL, calculation of the bessel function POLAR.PAS - example in PASCAL, conversion cartesian to polar coordinates PI.C - example in C, calculation of pi BESSEL.C - example in C, calculation of the bessel function POLAR.C - example in C, conversion cartesian to polar coordinates PI.CPP - example in C++, calculation of pi BESSEL.CPP - example in C++, calculation of the bessel function POLAR.CPP - example in C++, conversion cartesian to polar coordinates PI.EXE - example program executable The source code option includes further files: MFLOATA.ASM - source code in assembler (basic subroutines) MFLOATB.C - source code in C (transcendental functions and extensions) MFLOATC.ASM - constants (include file of "MFLOATA.ASM") CALCCON.PAS - calculation of constants in "MFLOATC.ASM" in PASCAL source CALCCON.EXE - calculation of constants in "MFLOATC.ASM" Registration: see READ.ME The authors of MFLOAT are interested in a feedback from the users of "MFLOAT". If you have problems or you find some bugs (the subroutines are checked very thorougly, but the probability, that a software of large dimension has no bug is not large in reality), please describe them and send us a mail per internet. You help us to remove this errors of the subroutines. Many thanks for your help and much success with "MFLOAT". Kaufmann Friedrich & Mueller Walter Students at the Technical University of Graz surface mail: Kaufmann Friedrich Schuetzenhofgasse 22 A-8010 GRAZ AUSTRIA EUROPE email address: fkauf@fstgds06.tu-graz.ac.at **************************************************************************** ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ **************************************************************************** Multiprecision calculations with "MFLOAT" ========================================= "MFLOAT" is a package of subroutines written in assembler for a 8086 micro- processor or compatibles. It is made for users having problems with the accuracy of standard IEEE arithmetic. The precision of IEEE-754 standard floating point numbers is good enough for most technical and scientific applications. But sometimes a scientist or a mathematician has problems in his calculations because the algorithm or the mathematical problem is very ill-conditioned (for example finding the roots of a polynomial). As the floating point numbers are only appoximations of real or rational numbers, there is an additional error at every floating point operation. There is no method known to represent every real number exactly in a binary format and therefore an approximation is necessary. A practical method is to extend the mantissa of the floating point number, which results in less error at every operation. But there are limitations because of higher memory needs for every floating point number and increased execution time. This is why a scientist having numerical problems of this kind is looking for very fast subroutines for multiprecision floating point numbers. This package represents such a package of subroutines. It uses a binary data format because there are optimal instructions for the manipulation of binary numbers in the instruction set of the 8086. A binary data format has also the advantage, that the used memory is minimized. The only disadvantage is the higher effort that is necessary for the input and output of the numbers in ASCII-Format. The main goal is a minimal computing time for very accurate numbers. Therefore no time is wasted for rounding (every result is truncated) and the handling of abnormal numbers as NANs or denormals. The only special number is zero. For saving time, there is no stack overflow check. In case of an overflow or some other calculation error an error flag is set, which can be tested during or at the end of the calculation. Of course in such a case the numerical result is wrong. In case of an underflow the result is set to zero without any message. A multiprecision number is consists of an array of words (1 word = 2 byte = 16 bit). The first word represents the exponent to the base of two. It is represented with an offset of 8000 Hex (= 32768) and ranges from 1 to FFFF Hex (= 65535). The special value 0 is used to represent the floating point number zero. In this case the rest of the array is arbitrary. The second word of the array represents the first word of mantissa. The mantissa is normalized for every floating point number and therefore the most significant bit of the first word of mantissa is always set. As there is no information in this bit, it is removed and at its place the sign bit is stored. The sign bit is one if the floating point number is negative, and zero if the floating point number is positive. The further words of the multiprecision number are words of mantissa. The length of mantissa ranging from one to a constant (15 in this release) can be choosen by the user. Example : 8002 490F Hex represents 3.141540528.. Every word of mantissa represents in average 4.8165 (= 16 * log10(2)) digits of a decimal number. Therefore you need 15 words of mantissa for a accuracy of 72 decimal digits and the whole number uses 16 words or 32 bytes. If you have the source code, you can increase the accuracy. The upper limit of the accuracy is given by the used stack of the subroutines, which is about 3 times the memory of a "MFLOAT" number for the arithmetic operations addition, subtraction, multiplication and division. The stack is limited to one segment of the 8086 which is 65536 byte. A further limit for transcendental functions is 125 mantissa words (602 digits of a decimal number) for the series of the sine. If you extend this limit, you will get an error, for an argument near pi/4. "MFLOATA.OBJ" includes 19 "MFLOAT" constants. This constants are stored in the code segment, which is limited to 65536 bytes. There is a very large range for the exponent. The greatest number, that can by represented is about +/-7.07E+9863, the smallest is about +/-7.07E-9865. Now lets take a look inside the subroutines. The coprocessor 8087 or a compatible is not used because the IEEE format does not match the data format. There are subroutines for the algebraic operations addition, subtraction, multiplication and division. Addition and subtraction are done by the same subroutine. The time for an addition (or subtraction) is proportional to the actual number of mantissa- words. The implementation of the multiplication uses the instruction "MUL". which is much faster than shifted additions. The multiplication result is two times as large as its operands but only the half of the accuracy is needed. Therfore its a waste of time to calculate the second part of the mantissa. If exactly half the number of mantissawords are calculated and the rest is neglected, there is an accumulation of truncating errors, which can be very large. Therfore an additional extra word is added to the result during calculation, but for the final result this extra word is ignored. By this way half of the calculation time is saved and only small additional errors occure. The calculation time increases quadratic with the number of mantissawords. The division is implemented using an algorithm similar to hand calculation. First an estimate of the first word of the result is made and the rest is calculated. If the rest is negativ, the estimation is corrected until it is ok. Using the rest the next word of mantissa is estimated and eventually corrected. This goes on until the whole mantissa of the result is calculated. For saving calculation time, the rest is truncated at every step by one word. Only at the first step an extra word is added (no truncation). Because of this method half of the calculation time is saved (same priciple as by the multiplication). The calculation time increases quadratic with the number of mantissawords. The first transcendental function is the square root. There is a hand calculation method known similar to a division. The implemented algorithm uses this method. For saving calcultion time the same tricks are used as in the implementation of multiplication and division. The calculation time is smaller than those of a division. There are also iterative methods known to calculate the square root. These methods are much slower than the implemented method. (3 - 5 times slower). Further trancendental functions are the trigonometric, hyperbolic functions, the exponential function and their invers functions. This functions can be derived from four series: sin(x) = x - x^3/6 + x^5/120 - ..... sinh(x) = x + x^3/6 + x^5/120 + ..... arctan(x) = x - x^3/3 + x^5/5 - ..... |x| < 1 artanh(x) = x + x^3/3 + x^5/5 + ..... |x| < 1 These are the basic functions for further calculations. All other functions can be derived from them. Here are some examples: cos(x) = sin(x + pi/2) tan(x) = sin(x) / cos(x) cot(x) = cos(x) / sin(x) cosh(x) = sqrt(1 + sqr(sinh(x)) tanh(x) = sinh(x) / cosh(x) coth(x) = cosh(x) / sinh(x) exp(x) = sinh(x) + cosh(x) arcsin(x) = arctan(x / sqrt(1 - sqr(x))) arccos(x) = pi/2 - arcsin(x) arccot(x) = pi/2 - arctan(x) arsinh(x) = artanh(x / sqrt(1 + sqr(x)) = ln(x + sqrt(1 + sqr(x))) arcosh(x) = artanh(sqrt(sqr(x) - 1) / x) = ln(x + sqrt(sqr(x) - 1)) arcoth(x) = artanh(1/x) ln(x) = 2 * artanh((x - 1) / (x + 1)) Some identities are used to save calculation time: sin(x + 2 * pi) = sin(x) exp(x + ln(2)) = 2 * exp(x) .... and so on. Most of the time is used for the calculation of the series. Therefore it is important to write an optimal assembler subroutine for the whole series. This series could also be programmed directly by the former subroutines but using a specialized assembler subroutine the calculation time can be reduced to one third at even higher accuracy. The calculation time for a series increases with the third power of the number of mantissawords. The subroutines are testet using the following compilers: Turbo Pascal 6.0, Borland Pascal 7.0, Borland C/C++ 3.1 You may use the library in combination with other compilers but in this case you may have to adabt the header files. Please note, that pascal calling conventions (upercase only, argument order, no underlines, ...) is implemented. All routines are 'far' with 'far' data pointers. =========================================================================== *************************************************************************** +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *************************************************************************** =========================================================================== Using the subroutines for BORLAND C 3.1 : ========================================= To make the functions available you have to include MFLOAT.H in your source and to specify MFLOATA.OBJ and MFLOATB.OBJ in your project file. #define mfloatwords 16 /* maximum precision */ typedef int mfloata[mfloatwords];/* mfloat array */ #define mfloat_ int far * /* pointer to array */ #define mfloat mfloata /* for use in 'C' programs */ (declared in "MFLOAT.H") This type declaration for multiprecision floating point numbers sets a maximal precision of 15 mantissawords or about 72 decimal digits. The maximum precision is an assembly time constant. In this release it is fixed to 15 mantissawords. It is possible to change the type declaration to reduce the number of mantissawords. This leads to a saving in memory needs and to higher execution speed at the cost of precision. It is also possible to adjust the precision dynamically using the function 'setmantissawords'. The programmer must take care, that the stack is large enough, because there is no check of a stack overflow. **************************************************************************** ++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++ **************************************************************************** void far pascal setmantissawords(int number); This procedure sets the length of the mantissa of mfloat numbers. The number of word must be in the range of [1..15], otherwise is is bounded to these values. If the number of mantissawords is increased and the numbers of former calculations are used further, the additional words are not initialized to zero, their values are not known. The error of the number due to this further mantissa words is of course very small and is in the range of the accuracy of the former calculation. If it is important, that this further mantissa words are zero, the user has to initialize them (see data format). If the number of mantissa words is reduced, every number can be seen as truncated, although nothing is changed at this numbers. The number of mantissa words refers only to calculations. **************************************************************************** int far pascal getmantissawords(void); The function returns the actual number of mantissa words. **************************************************************************** void far pascal reseterror(void); The error flag is reset. **************************************************************************** char far pascal geterror(void); The error flag is returned. If there ocures a runtime error during the calculation, the error flag is set to true (=1). **************************************************************************** mfloat_ far pascal equm(mfloat_ a, /* a <-- b */ const mfloat_ b); The value of b is copied to variable a with the defined accuracy. The value of b is not changed. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal addm(mfloat_ a, /* a <-- a + b */ const mfloat_ b); The value of b is added to the value a with the defined accuracy. The result is variable a. The value of b is not changed. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal subm(mfloat_ a, /* a <-- a - b */ const mfloat_ b); The value of b is subtracted from the value a with the defined accuracy. The result is variable a. The value of b is not changed. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal multm(mfloat_ a, /* a <-- a * b */ const mfloat_ b); The value of a is multiplied by value b with the defined accuracy. The result is variable a. The value of b is not changed. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal divm(mfloat_ a, /* a <-- a / b */ const mfloat_ b); The value of a is divided by the value b with the defined accuracy. The result is variable a. The value of b is not changed. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal multi(mfloat_ a, /* a <-- a * b */ int b); The value of a is multiplied by the integer value b with the defined accuracy. The result is variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal divi(mfloat_ a, /* a <-- a / b */ int b); The value of a is divided by the integer value b with the defined accuracy. The result is variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal inversm(mfloat_ a); /* a <-- 1 / a */ The number one is divided by the number a with the defined accuracy. The result is variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal negm(mfloat_ a); /* a <-- - a */ This procedure changes the sign of the number of variable a. The return value is a pointer to a. **************************************************************************** char far pascal eqzero(const mfloat_ a); /* eqzero <-- (a == 0) */ This function checks, if the multiprecision number a is zero. The value of a is not changed. The return value is 0 or 1. **************************************************************************** char far pascal gtzero(const mfloat_ a); /* gtzero <-- (a > 0) */ This function checks, if the multiprecision number a is greater than zero. The value of a is not changed. The return value is 0 or 1. **************************************************************************** char far pascal gezero(const mfloat_ a); /* gezero <-- (a >= 0) */ This function checks, if the multiprecision number a is greater than zero or equal zero. The value of a is not changed. The return value is 0 or 1. **************************************************************************** char far pascal gtm(const mfloat_ a, /* gtm <-- (a > b) */ const mfloat_ b); This function checks, if the multiprecision number a is greater than the multiprecision number b. The values of a and b are not changed. The return value is 0 or 1. **************************************************************************** char far pascal eqm(const mfloat_ a, /* eqm <-- (a == b) */ const mfloat_ b); This function checks, if the multiprecision number a is equal to the multiprecision number b. The values of a and b are not changed. The return value is 0 or 1. **************************************************************************** mfloat_ far pascal getzerom(mfloat_ a); /* a <-- 0 */ This procedure copies the value zero = 0.0000.. to the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal getonem(mfloat_ a); /* a <-- 1 */ This procedure copies the value one = 1.0000.. to the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal getpim(mfloat_ a); /* a <-- pi */ This procedure copies the value pi = 3.14159.. to the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal getln2m(mfloat_ a); /* a <-- ln(2) */ This procedure copies the value ln(2) = 0.69314.. to the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal getln10m(mfloat_ a); /* a <-- ln(10) */ This procedure copies the value ln(10) = 2.3025.. to the variable a. The return value is a pointer to a. **************************************************************************** int far pascal strtomf(mfloat_ a, /* a <-- b */ const char far * b); This procedure converts the string b to the multiprecision number a. If the conversion is correct, the integer zero is returned. Otherwise the number of the character, where the first error occurs, is retured. For conversion the accuracy of the calculation is set to maximum and afterwards is set back to the original value. This function is not optimized for speed, but for accuracy, because it is not used very often normally. **************************************************************************** char far * far pascal mftoa(char far * str, /* str <-- a */ const mfloat_ a, int len); This procedure converts the multiprecision number a to a string str with the number of ASCII - characters, which is defined in len. If the number of characters is too less, the string is set to stars '*'. The format used for this conversion is '.32767F' (see mftostr). For conversion the accuracy of the calculation is set to maximum and afterwards is set back to the original value. This function is not optimized for speed, but for accuracy, because it is not used very often normally. The return value is a pointer to the character array a. **************************************************************************** char far * far pascal mftostr(char far * str, /* str <-- a */ const mfloat_ a, int len, const char far * format); This procedure converts the multiprecision number a to a string str with a maximum number of ASCII - characters, which is defined in len. If the number of characters is too less, the string is set to stars '*'. The characters in the format - string have following meaning: '.' the point is shown ervery time, not significant zeros behind the point are shown 'E' scientific format with an exponent of ten 'F' fixpoint format, if the number is too large, a scientific format with an exponent of ten is used. 'G' fixpoint format, if the number is too large, or is less than 0.0001, a scientific format is used. '1..' an integer, which declares the number of digits behind the point for fixpoint and scientific format. The order of the characters (but not the digits for the precision) in the format string is arbitrary. It is also allowed to use lower case characters for the format, but in this case the letter for the exponent of ten is also the lower case letter 'e'. For conversion the accuracy of the calculation is set to maximum and afterwards is set back to the original value. This function is not optimized for speed, but for accuracy, because it is not used very often normally. The return value is a pointer to the character array a. **************************************************************************** double far pascal mftod(const mfloat_ b); /* mftod <-- b */ This function converts a multiprecision number to a double number. The value of b is not changed. **************************************************************************** long double far pascal mftold(const mfloat_ b); /* mftold <-- b */ This function converts a multiprecision number to a long double number. The value of b is not changed. **************************************************************************** mfloat_ far pascal dtomf(mfloat_ a, /* a <-- b */ double b); This procedure converts a double precision number to a multiprecision number. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal ldtomf(mfloat_ a, /* a <-- b */ long double b); This procedure converts a long double number to a multiprecision number. The return value is a pointer to a. **************************************************************************** +++++++++++++++ standart functions (Borland C: MATH.H) +++++++++++++++ **************************************************************************** mfloat_ far pascal acosm(mfloat_ a); /* a <-- acos(a) */ This procedure calculates the invers cos(x) function of the argument a. The result is the variable a. -1 <= a <= 1 range of the argument, otherwise the error flag is set 0 <= a <= pi range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal asinm(mfloat_ a); /* a <-- asin(a) */ This procedure calculates the invers sin(x) function of the argument a. The result is the variable a. -1 <= a <= 1 range of the argument, otherwise the error flag is set -pi/2 <= a <= pi/2 range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal atanm(mfloat_ a); /* a <-- atan(a) */ This procedure calculates the invers tan(x) function of the argument a. The result is the variable a. -pi/2 < a < pi/2 range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal atan2m(mfloat_ a, /* a <-- atan2(a,b) */ const mfloat_ b); This function calulates the extended arctan function phi = arctan(a/b). If a is zero or less than zero, the correct value phi is retured, which is in the intervall (-pi, pi]: a = r * sin(phi) b = r * cos(phi) r = sqrt(sqr(a) + sqr(b)) >= 0 If a = 0 and b = 0, the value phi is not unique. The error flag is set and the number zero is returned. The return variable is the variable a. Variable b is not changed. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal ceilm(mfloat_ a); /* a <-- ceil(a) */ The least integer number, which is greater than or equal the mfloat number a, is calculated and returned to the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal cosm(mfloat_ a); /* a <-- cos(a) */ This procedure calculates the cos(x) function of the argument a. The result is the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal coshm(mfloat_ a); /* a <-- cosh(a) */ This procedure calculates the cosh(x) function of the argument a. The result is the variable a. -22712 < a < 22712 range of the argument without an overflow a >= 1 range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal expm(mfloat_ a); /* a <-- exp(a) */ This procedure calculates exponential function exp(x) of the argument a. The result is the variable a. a < 22712 range of the argument without an overflow a < -22712 there is an underflow result = 0 a >= 0 range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal fabsm(mfloat_ a); /* a <-- fabs(a) */ This procedure returns the absolute value of the number a. The result is the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal floorm(mfloat_ a); /* a <-- floor(a) */ The greatest integer number, which is less than or equal the mfloat number a, is calculated and returned to the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal fmodm(mfloat_ a, /* a <-- fmod(a,b) */ const mfloat_ b); This is a floating point modul function. The number a is divided by b, so that the result is an integer. The rest of this division is returned by the variable a. The variable b is not changed. a = k * b + c; k is an integer; 0 <= c / b < 1; c is assigned to a If the variable b is zero, the error flag is set. For negative b, the result is negativ or zero. For very large k, which can not be represented as an mfloat number exactly, the result is zero. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal frexpm(mfloat_ a, /* a <-- frexp(a,b) */ int * b); The mfloat number a is splitted in the exponent and in the mantissa: a = c * (2**b); 0.5 <= |c| < 1 if a <> 0; c is assigned to a This procedure returns two results. IF the mfloat number is zero, it remains zero and b is set to zero. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal hypotm(mfloat_ a, /* a <-- hypot(a,b) */ const mfloat_ b); This function is used for the calculation of the absolute value of a complex number: c = sqrt(sqr(a) + sqr(b)); c is assigned to a; b remains unchanged. |a + i*b| = hypot(a,b); This procedure calculates the correct result also if sqr(a) can not be represented as a mfloat number, because it is too great or too small. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal ldexpm(mfloat_ a, /* a <-- ldexp(a,b) */ int b); The mfloat number a is multiplied by the power of two 2**b. a <-- a * (2**b); The calculation is very fast because only the exponent of two is manipulated. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal logm(mfloat_ a); /* a <-- log(a) */ This procedure calculates the natural logarithm ln(x) of the argument a. The result is the variable a. a > 0 range of the argument -22712 <= a <= 22712 range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal log10m(mfloat_ a); /* a <-- log10(a) */ This procedure calculates the decade logarithm log(x) of the argument a. The result is the variable a. a > 0 range of the argument -9864 <= a <= 9863 range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal modfm(mfloat_ a, /* a,b <-- modf(a) */ mfloat_ b); This procedure splits the number a in an integer part and in an fractional part: a = k + b; k is an integer; 0 <= b < 1; k is assigned to a The return value is a pointer to a. **************************************************************************** mfloat_ far pascal powm(mfloat_ a, /* a <-- pow(a,b) */ const mfloat_ b); This procedure calculates the power of the basis a by the exponent b. The result is assigned to the variable a, the variable b is not changed. If the exponent is an integer, the calculation is made by repeated multiplications. The number of multiplications is proportional to log(|b|). If b has a fractional, the calculation is made by logarithm. In this case, the base a must be positiv, otherwise the error flag is set, because the result is not real. If the number b is very large so that no fractional can be represented by the mfloat format and the base is negativ, the error flag is set too. If a = 0 and b <= 0 the result is not defined and the error flag is set. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal pow10m(mfloat_ a, /* a <-- 10**b */ int b); The power of the basis ten is build with the integer exponent b. The result is assigned to a. a <-- 10**b if b < -9864 then a = 0 if b > 9863 then there is an overflow -> error flag is set The return value is a pointer to a. **************************************************************************** mfloat_ far pascal sinm(mfloat_ a); /* a <-- sin(a) */ This procedure calculates the sin(x) function of the argument a. The result is the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal sinhm(mfloat_ a); /* a <-- sinh(a) */ This procedure calculates the sinh(x) function of the argument a. The result is the variable a. -22712 < a < 22712 range of the argument without an overflow The return value is a pointer to a. **************************************************************************** mfloat_ far pascal sqrtm(mfloat_ a); /* a <-- sqrt(a) */ This procedure calculates the square root of the number a. The result is variable a. If a is negativ, the error flag is set. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal tanm(mfloat_ a); /* a <-- tan(a) */ This procedure calculates the tan(x) function of the argument a. The result is the variable a. It is faster than the calculation of sin(a) / cos(a). The return value is a pointer to a. **************************************************************************** mfloat_ far pascal tanhm(mfloat_ a); /* a <-- tanh(a) */ This procedure calculates the tanh(x) function of the argument a. The result is the variable a. The return value is a pointer to a. **************************************************************************** ++++++++++++++++++ extended standart functions ++++++++++++++++++++++++ **************************************************************************** mfloat_ far pascal acoshm(mfloat_ a); /* a <-- acosh(a) */ This procedure calculates the invers cosh(x) function of the argument a. The result is the variable a. 1 <= a range of the argument 0 <= a < 22712 range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal acotm(mfloat_ a); /* a <-- acot(a) */ This procedure calculates the invers cot(x) function of the argument a. The result is the variable a. 0 < a < pi range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal acothm(mfloat_ a); /* a <-- acoth(a) */ This procedure calculates the invers coth(x) function of the argument a. The result is the variable a. -1 > a or 1 < a range of the argument The return value is a pointer to a. **************************************************************************** mfloat_ far pascal asinhm(mfloat_ a); /* a <-- asinh(a) */ This procedure calculates the invers sinh(x) function of the argument a. The result is the variable a. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal atanhm(mfloat_ a); /* a <-- atanh(a) */ This procedure calculates the invers tanh(x) function of the argument a. The result is the variable a. -1 < a < 1 range of the argument The return value is a pointer to a. **************************************************************************** mfloat_ far pascal cossinm(mfloat_ a, /* a <-- cos(a), */ mfloat_ b); /* b <-- sin(a) */ This procedure calculates the sin(x) and cos(x) function by only one evulation of the series for the sine function. This procedure is therefore faster than two individual calculations. It is recommended for example for the calculation of the exponential function of complex arguments. The return value is a pointer to a. **************************************************************************** mfloat_ far pascal cotm(mfloat_ a); /* a <-- cot(a) */ This procedure calculates the cot(x) function of the argument a. The result is the variable a. It is faster than the calculation of cos(a) / sin(a). The return value is a pointer to a. **************************************************************************** mfloat_ far pascal cothm(mfloat_ a); /* a <-- coth(a) */ This procedure calculates the coth(x) function of the argument a. The result is the variable a. range of the result: a >=1 or a <= -1 The return value is a pointer to a. **************************************************************************** mfloat_ far pascal exp10m(mfloat_ a); /* a <-- 10 ** a */ This procedure calculates the power of 10 of the argument a. The result is the variable a. a < 9863 range of the argument without an overflow a < -9864 there is an underflow, result = 0 a >= 0 range of the result The return value is a pointer to a. **************************************************************************** mfloat_ far pascal sqrm(mfloat_ a); /* a <-- sqr(a) */ This procedure calculates the sqare of a. The result is variabel a. sqr(a) = a * a The return value is a pointer to a. **************************************************************************** mfloat_ far pascal truncm(mfloat_ a); /* a <-- trunc(a) */ This procedure returns the integer part of the number a (truncation). The result is variable a. The sign has no influence and remains unchanged. The return value is a pointer to a. =========================================================================== *************************************************************************** +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *************************************************************************** =========================================================================== Using the subroutines for Borland C++ 3.1 : =========================================== Using the mfloat library with C++ all of the mathematical operators are overloaded to allow 'normal' calculations with the mfloat class. To make the functions available you have to include MFLOAT.CXX (includes MFLOAT.H) in your source and to specify MFLOATA.OBJ and MFLOATB.OBJ in your project file. All operators available for the type 'double' are also defined for 'mfloat', So it is very easy to convert a program using double variables to use the mfloat library. Please note that it is not allowed to assign 'mfloat' variables to other numeric types. To solve this problem please use 'mftod' and 'mftold' (see the example programs!). overloaded operators: = ++ (prefix, suffix) -- (prefix, suffix) + (unary, binary) - (unary, binary) * / += -= *= /= < > <= >= == != << (output to stream) >> (input from stream) The following functions are overloaded mfloat equivalents of the functions defined in 'math.h': acos, asin, atan, atan2, ceil, cos, cosh, exp, fabs, floor, fmod, frexp, hypot, ldexp, log, log10, modf, pow, sin, sinh, sqrt, tan, tanh. Instead of 'atof' use 'atofm' (overloading problem). Instead of 'pow10' use 'pow10m' (overloading problem). The following functions are extensions to math.h: acosh, asinh, atanh, acoth, cot, coth, exp10, sqr, trunc. Further functions are available: void pascal setmantissawords(int number); int pascal getmantissawords(void); void pascal reseterror(void); char pascal geterror(void); int pascal strtomf(mfloat &a, const char *s); char * pascal mftoa(char *str, const mfloat &a, int len); char * pascal mftostr(char *str, const mfloat &a, int len, const char * format); double pascal mftod(const mfloat &b); long double pascal mftold(const mfloat &b); All functions are mapped to the functions defined in MFLOAT.H. So please take a look in description for using the functions with Borland C above. =========================================================================== *************************************************************************** +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *************************************************************************** =========================================================================== Using the subroutines for TURBO PASCAL 6.0, 7.0 or BORLAND PASCAL 7.0 : ======================================================================= CONST MfloatWords = 16; TYPE mfloat = ARRAY[0..MfloatWords-1] OF integer; (declared in "PFLOAT.PAS") This type declaration for multiprecision floating point numbers sets a maximal precision of 15 mantissawords or about 72 decimal digits. The maximum precision is an assembly time constant. In this release it is fixed to 15 mantissawords. It is possible to change the type declaration to reduce the number of mantissawords. This leads to a saving in memory needs and to higher execution speed at the cost of precision. It is also possible to adjust the precision dynamically using the function 'setmantissawords'. The programmer must take care, that the stack is large enough, because there is no check of a stack overflow. **************************************************************************** ++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++ **************************************************************************** void far pascal setmantissawords(int number); **************************************************************************** ++++++++++++++++++++ mfloat basic functions ++++++++++++++++++++++++++ **************************************************************************** PROCEDURE Setmantissawords(number : integer); This procedure sets the length of the mantissa of mfloat numbers. The number of word must be in the range of [1..15], otherwise is is bounded to these values. If the number of mantissawords is increased and the numbers of former calculations are used further, the additional words are not initialized to zero, their values are not known. The error of the number due to this further mantissa words is of course very small and is in the range of the accuracy of the former calculation. If it is important, that this further mantissa words are zero, the user has to initialize them (see data format). If the number of mantissa words is reduced, every number can be seen as truncated, although nothing is changed at this numbers. The number of mantissa words refers only to calculations. **************************************************************************** FUNCTION GetMantissawords : integer; The function returns the actual number of mantissa words. **************************************************************************** PROCEDURE ResetError; The error flag is reset. **************************************************************************** FUNCTION GetError : boolean; The error flag is returned. If there ocures a runtime error during the calculation, the error flag is set to true. **************************************************************************** PROCEDURE equm(VAR a, b : mfloat); { *** a <-- b *** } The value of b is copied to variable a with the defined accuracy. The value of b is not changed. **************************************************************************** PROCEDURE addm(VAR a, b : mfloat); { *** a <-- a + b *** } The value of b is added to the value a with the defined accuracy. The result is variable a. The value of b is not changed. **************************************************************************** PROCEDURE subm(VAR a, b : mfloat); { *** a <-- a - b *** } The value of b is subtracted from the value a with the defined accuracy. The result is variable a. The value of b is not changed. **************************************************************************** PROCEDURE multm(VAR a, b : mfloat); { *** a <-- a * b *** } The value of a is multiplied by value b with the defined accuracy. The result is variable a. The value of b is not changed. **************************************************************************** PROCEDURE divm(VAR a, b : mfloat); { *** a <-- a / b *** } The value of a is divided by the value b with the defined accuracy. The result is variable a. The value of b is not changed. **************************************************************************** PROCEDURE multi(VAR a : mfloat; b : integer); { *** a <-- a * b *** } The value of a is multiplied by the integer value b with the defined accuracy. The result is variable a. **************************************************************************** PROCEDURE divi(VAR a : mfloat; b : integer); { *** a <-- a / b *** } The value of a is divided by the integer value b with the defined accuracy. The result is variable a. **************************************************************************** PROCEDURE inversm(VAR a : mfloat); { *** a <-- 1 / a *** } The number one is divided by the number a with the defined accuracy. The result is variable a. **************************************************************************** PROCEDURE negm(VAR a : mfloat); { *** a <-- - a *** } This procedure changes the sign of the number of variable a. **************************************************************************** FUNCTION eqZero(VAR a : mfloat) : boolean; { *** eqZero <-- a = 0 *** } This function checks, if the multiprecision number a zero. The value of a is not changed. **************************************************************************** FUNCTION gtZero(VAR a : mfloat) : boolean; { *** gtZero <-- a > 0 *** } This function checks, if the multiprecision number a is greater than zero. The value of a is not changed. **************************************************************************** FUNCTION geZero(VAR a : mfloat) : boolean; { *** geZero <-- a >= 0 *** } This function checks, if the multiprecision number a is greater than zero or equal zero. The value of a is not changed. **************************************************************************** FUNCTION gtm(VAR a, b : mfloat) : boolean; { *** gtm <-- a > b *** } This function checks, if the multiprecision number a is greater than the multiprecision number b. The values of a and b are not changed. **************************************************************************** FUNCTION eqm(VAR a, b : mfloat) : boolean; { *** eqm <-- a = b *** } This function checks, if the multiprecision number a is equal to the multiprecision number b. The values of a and b are not changed. **************************************************************************** PROCEDURE GetZerom(VAR a : mfloat); { *** a <- 0 *** } This procedure copies the value zero = 0.0000.. to the variable a. **************************************************************************** PROCEDURE GetOnem(VAR a : mfloat); { *** a <- 1 *** } This procedure copies the value one = 1.0000.. to the variable a. **************************************************************************** PROCEDURE GetPim(VAR a : mfloat); { *** a <- pi *** } This procedure copies the value pi = 3.14159.. to the variable a. **************************************************************************** PROCEDURE GetLn2m(VAR a : mfloat); { *** a <- ln(2) *** } This procedure copies the value ln(2) = 0.69314.. to the variable a. **************************************************************************** PROCEDURE GetLn10m(VAR a : mfloat); { *** a <- ln(10) *** } This procedure copies the value ln(10) = 2.3025.. to the variable a. **************************************************************************** FUNCTION strtomf(VAR a : mfloat; { *** a <-- string *** } b : string) : integer; This procedure converts the string b to the multiprecision number a. If the conversion is correct, the integer zero is returned. Otherwise the number of the character, where the first error occurs, is retured. For conversion the accuracy of the calculation is set to maximum and afterwards is set back to the original value. This function is not optimized for speed, but for accuracy, because it is not used very often normally. **************************************************************************** FUNCTION mftoa(VAR a : mfloat; { *** string <-- a *** } len : integer) : string; This procedure converts the multiprecision number a to a string str with the number of ASCII - characters, which is defined in len. If the number of characters is too less, the string is set to stars '*'. The format used for this conversion is '.32767F' (see FUNCTION mftostr). For conversion the accuracy of the calculation is set to maximum and afterwards is set back to the original value. This function is not optimized for speed, but for accuracy, because it is not used very often normally. **************************************************************************** FUNCTION mftostr(VAR a : mfloat; { *** string <-- a *** } len : integer; format : string) : string; This procedure converts the multiprecision number a to a string str with a maximum number of ASCII - characters, which is defined in len. If the number of characters is too less, the string is set to stars '*'. The characters in the format - string have following meaning: '.' the point is shown ervery time, not significant zeros behind the point are shown 'E' scientific format with an exponent of ten 'F' fixpoint format, if the number is too large, a scientific format with an exponent of ten is used 'G' fixpoint format, if the number is too large, or less than 0.0001, a scientific format is used. '1..' an integer, which declares the number of digits behind the point for fixpoint and scientific format The order of the characters (but not the digits for the precision) in the format string is arbitrary. It is also allowed to use lower case characters for the format, but in this case the letter for the exponent of ten is also the lower case letter 'e'. For conversion the accuracy of the calculation is set to maximum and afterwards is set back to the original value. This function is not optimized for speed, but for accuracy, because it is not used very often normally. **************************************************************************** FUNCTION MfToD(VAR a : mfloat) : double; { *** MfToD <- a *** } This function converts a multiprecision number to a double precision number. The value of a is not changed. **************************************************************************** FUNCTION MfToLd(VAR a : mfloat) : extended; { *** MfToLd <- a *** } This function converts a multiprecision number to an extended number. The value of a is not changed. **************************************************************************** PROCEDURE DToMf(VAR a : mfloat; b : double); { *** a <- b *** } This procedure converts a double precision number to a multiprecision number. **************************************************************************** PROCEDURE LdToMf(VAR a : mfloat; b : extended);{ *** a <- b *** } This procedure converts an extended number to a multiprecision number. **************************************************************************** ++++++++++++++++ standart functions (Borland C: MATH.H) ++++++++++++++ **************************************************************************** PROCEDURE acosm(VAR a : mfloat); { *** a <- arccos(a) *** } This procedure calculates the invers cos(x) function of the argument a. The result is the variable a. -1 <= a <= 1 range of the argument, otherwise the error flag is set 0 <= a <= pi range of the result **************************************************************************** PROCEDURE asinm(VAR a : mfloat); { *** a <- arcsin(a) *** } This procedure calculates the invers sin(x) function of the argument a. The result is the variable a. -1 <= a <= 1 range of the argument, otherwise the error flag is set -pi/2 <= a <= pi/2 range of the result **************************************************************************** PROCEDURE atanm(VAR a : mfloat); { *** a <- arctan(a) *** } This procedure calculates the invers tan(x) function of the argument a. The result is the variable a. -pi/2 < a < pi/2 range of the result **************************************************************************** PROCEDURE atan2m(VAR a, b : mfloat); { *** a <- arg(b + ia) *** } This function calulates the extended arctan function phi = arctan(a/b). If a is zero or less than zero, the correct value phi is retured, which is in the intervall (-pi, pi]: a = r * sin(phi) b = r * cos(phi) r = sqrt(sqr(a) + sqr(b)) >= 0 If a = 0 and b = 0, the value phi is not unique. The error flag is set and the number zero is returned. The return variable is the variable a. Variable b is not changed. **************************************************************************** PROCEDURE ceilm(VAR a : mfloat); { *** a <-- ceil(a) *** } The least integer number, which is greater than or equal the mfloat number a, is calculated and returned to the variable a. **************************************************************************** PROCEDURE cosm(VAR a : mfloat); { *** a <- cos(a) *** } This procedure calculates the cos(x) function of the argument a. The result is the variable a. **************************************************************************** PROCEDURE coshm(VAR a : mfloat); { *** a <- cosh(a) *** } This procedure calculates the cosh(x) function of the argument a. The result is the variable a. -22712 < a < 22712 range of the argument without an overflow a >= 1 range of the result **************************************************************************** PROCEDURE expm(VAR a : mfloat); { *** a <- exp(a) *** } This procedure calculates exponential function exp(x) of the argument a. The result is the variable a. a < 22712 range of the argument without an overflow a < -22712 there is an underflow result = 0 a >= 0 range of the result **************************************************************************** PROCEDURE fabsm(VAR a : mfloat); { *** a <-- fabs(a) *** } This procedure returns the absolute value of the number a. The result is the variable a. **************************************************************************** PROCEDURE floorm(VAR a : mfloat); { *** a <-- floor(a) *** } The greatest integer number, which is less than or equal the mfloat number a, is calculated and returned to the variable a. **************************************************************************** PROCEDURE fmodm(VAR a, b : mfloat); { *** a <-- fmod(a,b) *** } This is a floating point modul function. The number a is divided by b, so that the result is an integer. The rest of this division is returned by the variable a. The variable b is not changed. a = k * b + c; k is an integer; 0 <= c / b < 1; c is assigned to a If the variable b is zero, the error flag is set. For negative b, the result is negativ or zero. For very large k, which can not be represented as an mfloat number exactly, the result is zero. **************************************************************************** PROCEDURE frexpm(VAR a : mfloat; { *** a <-- frexp(a,b) *** } VAR b : integer); The mfloat number a is splitted in the exponent and in the mantissa: a = c * (2**b); 0.5 <= |c| < 1 if a <> 0; c is assigned to a This procedure returns two results. IF the mfloat number is zero, it remains zero and b is set to zero. **************************************************************************** PROCEDURE hypotm(VAR a, b : mfloat); { *** a <-- hypot(a,b) *** } This function is used for the calculation of the absolute value of a complex number: c = sqrt(sqr(a) + sqr(b)); c is assigned to a; b remains unchanged. |a + i*b| = hypot(a,b); This procedure calculates the correct result also if sqr(a) can not be represented as a mfloat number, because it is too great or too small. **************************************************************************** PROCEDURE ldexpm(VAR a : mfloat; b : integer); { *** a <-- ldexp(a,b) *** } The mfloat number a is multiplied by the power of two 2**b. a <-- a * (2**b); The calculation is very fast because only the exponent of two is manipulated. **************************************************************************** PROCEDURE logm(VAR a : mfloat); { *** a <- log(a) *** } This procedure calculates the natural logarithm ln(x) of the argument a. The result is the variable a. a > 0 range of the argument -22712 <= a <= 22712 range of the result **************************************************************************** PROCEDURE log10m(VAR a : mfloat); { *** a <- log10(a) *** } This procedure calculates the decade logarithm log(x) of the argument a. The result is the variable a. a > 0 range of the argument -9864 <= a <= 9863 range of the result **************************************************************************** PROCEDURE modfm(VAR a, b : mfloat); { *** a, b <-- modf(a) *** } This procedure splits the number a in an integer part and in an fractional part: a = k + b; k is an integer; 0 <= b < 1; k is assigned to a **************************************************************************** PROCEDURE powm(VAR a, b : mfloat); { *** a <-- pow(a,b) *** } This procedure calculates the power of the basis a by the exponent b. The result is assigned to the variable a, the variable b is not changed. If the exponent is an integer, the calculation is made by repeated multiplications. The number of multiplications is proportional to log(|b|). If b has a fractional, the calculation is made by logarithm. In this case, the base a must be positiv, otherwise the error flag is set, because the result is not real. If the number b is very large so that no fractional can be represented by the mfloat format and the base is negativ, the error flag is set too. If a = 0 and b <= 0 the result is not defined and the error flag is set. **************************************************************************** PROCEDURE pow10m(VAR a : mfloat; b : integer); { *** a <-- pow10(a) *** } The power of the basis ten is build with the integer exponent b. The result is assigned to a. a <-- 10**b if b < -9864 then a = 0 if b > 9863 then there is an overflow -> error flag is set **************************************************************************** PROCEDURE sinm(VAR a : mfloat); { *** a <- sin(a) *** } This procedure calculates the sin(x) function of the argument a. The result is the variable a. **************************************************************************** PROCEDURE sinhm(VAR a : mfloat); { *** a <- sinh(a) *** } This procedure calculates the sinh(x) function of the argument a. The result is the variable a. -22712 < a < 22712 range of the argument without an overflow **************************************************************************** PROCEDURE sqrtm(VAR a : mfloat); { *** a <- sqrt(a) *** } This procedure calculates the square root of the number a. The result is variable a. If a is negativ, the error flag is set. **************************************************************************** PROCEDURE tanm(VAR a : mfloat); { *** a <- tan(a) *** } This procedure calculates the tan(x) function of the argument a. The result is the variable a. It is faster than the calculation of sin(a) / cos(a). **************************************************************************** PROCEDURE tanhm(VAR a : mfloat); { *** a <- tanh(a) *** } This procedure calculates the tanh(x) function of the argument a. The result is the variable a. -1 <= a <= 1 range of the result **************************************************************************** ++++++++++++++++++ extended standart functions ++++++++++++++++++++++++ **************************************************************************** PROCEDURE acoshm(VAR a : mfloat); { *** a <- arcosh(a) *** } This procedure calculates the invers cosh(x) function of the argument a. The result is the variable a. 1 <= a range of the argument 0 <= a < 22712 range of the result **************************************************************************** PROCEDURE acotm(VAR a : mfloat); { *** a <- arccot(a) *** } This procedure calculates the invers cot(x) function of the argument a. The result is the variable a. 0 < a < pi range of the result **************************************************************************** PROCEDURE acothm(VAR a : mfloat); { *** a <- arcoth(a) *** } This procedure calculates the invers coth(x) function of the argument a. The result is the variable a. -1 > a or 1 < a range of the argument **************************************************************************** PROCEDURE asinhm(VAR a : mfloat); { *** a <- arsinh(a) *** } This procedure calculates the invers sinh(x) function of the argument a. The result is the variable a. **************************************************************************** PROCEDURE atanhm(VAR a : mfloat); { *** a <- artanh(a) *** } This procedure calculates the invers tanh(x) function of the argument a. The result is the variable a. -1 < a < 1 range of the argument **************************************************************************** PROCEDURE cossinm(VAR a, b : mfloat); { *** a <-- cos(a), b <-- sin(a) *** } This procedure calculates the sin(x) and cos(x) function by only one evulation of the series for the sine function. This procedure is therefore faster than two individual calculations. It is recommended for example for the calculation of the exponential function of complex arguments. **************************************************************************** PROCEDURE cotm(VAR a : mfloat); { *** a <- cot(a) *** } This procedure calculates the cot(x) function of the argument a. The result is the variable a. It is faster than the calculation of cos(a) / sin(a). **************************************************************************** PROCEDURE cothm(VAR a : mfloat); { *** a <- coth(a) *** } This procedure calculates the coth(x) function of the argument a. The result is the variable a. range of the result: a >=1 or a <= -1 **************************************************************************** PROCEDURE exp10m(VAR a : mfloat); { *** a <- 10 ** a *** } This procedure calculates the power of 10 of the argument a. The result is the variable a. a < 9863 range of the argument without an overflow a < -9864 there is an underflow, result = 0 a >= 0 range of the result **************************************************************************** PROCEDURE sqrm(VAR a : mfloat); { *** a <- sqr(a) *** } This procedure calculates the sqare of a. The result is variabel a. sqr(a) = a * a **************************************************************************** PROCEDURE truncm(VAR a : mfloat); { *** a <-- trunc(a) *** } This procedure returns the integer part of the number a (truncation). The result is variable a. The sign has no influence and remains unchanged. =========================================================================== *************************************************************************** +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *************************************************************************** =========================================================================== Examples with MFLOAT: a) The calculation of the number PI: There are many methods known to calculate the irrational number pi, which is the relation between the perimeter and the diameter of a circle. Pi = 16 * arctan(1 / 5) - 4 * arctan(1 / 239); Pi = 48 * arctan(1 / 18) + 32 * arctan(1 / 57) - 20 * arctan(1 / 239); Pi = 6 * arctan(1 / sqrt(3)); PI = 4 * (1 - 1/3 + 1/5 - 1/7 + ....) The function arctan(x) is calculated with the series: arctan(x) = x - x**3 / 3 + x**5 / 5 - x**7 / 7 + ...... The operation "x**.." means the power of x. The advantage of the first two equations is, that they converge very fast and every element of the series can be calculated recursivly from the former element by only two divisions by an integer. A division by an integer is very fast for long MFLOAT-numbers. The third equation is build by the known value : tan(pi/6) = 1 / sqrt(3). This method is slower than the former. The forth equation has only theoretical importance, in the numerical use this equation is very bad. The example program compares the different methods to calculate pi. b) The evulation of the bessel function for large arguments: There are some series known, where some elements are very large and the result is small. In this case numerical errors are very large. An example is the bessel function J0(x) for an argument x = 100.0. J0(x) = 1 - (x/2)**2 / (1! * 1!) + (x/2)**4 / (2! * 2!) - ...... The series converges very good, but there are problems with the accuracy of the calculation. The test program shows the influence of the accuracy of the calculation to the result. c) Conversion of cartesian coordinates to polar coordinates: This example shows the use of the functions hypot(x,y) and atan2(x,y). This program also demonstrates the input of MFLOAT-numbers. =========================================================================== *************************************************************************** +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *************************************************************************** =========================================================================== Graz, 28.4.1993 Kaufmann Friedrich & Mueller Walter Students at the Technical University of Graz Registration address: Kaufmann Friedrich Schuetzenhofgasse 22 A-8010 GRAZ AUSTRIA EUROPE email address: fkauf@fstgds06.tu-graz.ac.at